Should you believe that this coin is fair? 
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Faced with a sequence of N binary events, such as coin flips (or Ising spins), it is natural to ask 
whether these events reflect some underlying dynamic signals or are just random. Plausible models 
for the dynamics of hidden biases lead to surprisingly high probabilities of misidentifying random 
sequences as biased. In particular, this probability decays as A r_1//4 , so that no reasonable amount of 
data would be sufficient to induce the concept of a fair coin with high probability. I suggest that these 
theoretical results may be relevant to understanding experiments on the apparent misperception of 
random sequences by human observers. 



There is a large literature testifying to the errors that 
humans make in reasoning about probability 0, . Per- 
haps most fundamental is the claim that people routinely 
detect order and hidden causes in genuinely random se- 
quences 0- These apparent limitations on human ratio- 
nality have broad implications, not least for economics, 
and have attracted considerable attention in the popu- 
lar press. In contrast with these results, a number of 
experiments indicate that humans and other animals can 
change their behavior in response to changes in the prob- 
abilities of stimuli and rewards, sometimes making opti- 
mal use of the available data 0, H, • Similarly, many 
perceptual discriminations approach the limits to relia- 
bility set by noise near the sensory input 0, E3 , and 
related ideas of statistical optimization have emerged in 
recent work on motor control [ill Ip3 . ITsT ] . There is even 
the suggestion that if the detection of order vs. random- 
ness is cast in the standard two-alternative format for 
perceptual discrimination experiments, then people can 
learn to perform with close to the statistically maximum 
reliability 14]. While nearly perfect neural processing of 
statistical data under some conditions could coexist with 
qualitative failures at similar problems under different 
conditions, it would be attractive to have a more unified 
view of the brain as an engine for probabilistic inference. 

The problem of identifying genuinely random se- 
quences has a number of subtleties that seem not to have 
been emphasized in the previous literature. In particu- 
lar, from a Bayesian point of view our confidence that a 
given sequence really was generated at random depends 
entirely on the universe of alternative models that we are 
willing to consider. Here I consider a family of models 
that involve time-dependent biases of unknown magni- 
tude, similar in spirit to the models that would allow op- 
timal inference under the conditions of the experiments 
m Refs 0,0, and show that an observer who tries to 
understand the world using these models will exhibit a 
surprisingly large probability of misidentifying random 
sequences as having small but nonzero biases. Most im- 
portantly this probability declines only as a fractional 
power of the sequence length, so that (for example) no 
reasonable human experience with the flipping of a coin 



would provide sufficient evidence to induce the concept 
of fairness with high probability. 

To be concrete let us consider data in the form of bi- 
nary sequences — coin flips (heads/tails), for example, or 
a sequence of rewards/nonrewards from a particular class 
of actions. Let the observations come in a sequence la- 
beled by n = 1, 2, • • • , N, and let the binary variable on 
each observation n be denoted by cr n . If the process is 
described by a fair coin then all sequences occur with 
equal probability, 
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P({a n }|fair coin) = 



(1) 



The problem faced by the observer, however, is to es- 
timate the probability that this particular sequence was 
generated by a fair coin, that is P(fair coin|{<7 n }). Bayes' 
rule tells us that we can write this probability as 



P(fair coin]{<7 n }) = 



P({<r n }|fair coin)P(fair coin) 



(2) 



where P(fair coin) measures the a priori probability of a 
fair coin and P({cr n }) measures the probability that this 
particular sequence will arise, averaged over all possible 
models that might describe it; schematically we can write 

P(M)= ^(K}|model)P(model), (3) 
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and of course one of the possible models is the fair coin. 
Because the fair coin generates all sequences with equal 
probability [Eq JJJ], the probability that a fair coin gen- 
erated a given sequence depends on the details of the se- 
quence only through the term P({er n }), i.e. only through 
the average over all possible models of the sequence. 
Thus our confidence that our experience is described by a 
fair coin depends entirely on the set of alternative models 
that we think is appropriate to the situation. 

A plausible set of alternatives to the fair coin is that 
each cr n responds independently to a bias, but this bias 
may change over time. To specify this class of models 
completely requires at least three steps. First we need to 
describe the bias on each trial. To treat heads and tails 
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symmetrically we can represent heads/tails as an Ising 
spin cr n = ±1, and measure the bias on observation n as 
an effective magnetic field h D such that [l5| 
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Assuming that the bias acts on each observation inde- 
pendently, we have 



pa<rn}\{h n }) = n 



exp(/i n cr n ) = J_ -f({a n ;h n }) 

2cosh(/i n ) 2 N 
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f({<J n ; h n }) = ^2 [lncosh(/i n ) + 



(6) 



A particular set of biases {h n } constitutes one possible 
model. The second step is that to average over models 
we will need a hypothesis about the distribution of these 
biases. Consider the case where the bias is on average 
zero (coins or circumstances that favor heads are as likely 
as those that favor tails), and where the presumably small 
fluctuations in bias are drawn from a Gaussian with root- 
mean-square fluctuations h rms . Dynamics of the bias 
vs. time are described by a correlation function C(r), 
(hnhm) = h 2 mB C(n — m). If, for example, the bias tends 
to stay constant for runs of 10 observations, then C(r) 
should be close to 1 for |r| < 10 and fall to zero for 
|t| 10. It is convenient to the think about a matrix C 
with elements defined by (C) nm = C(n — m). Then the 
full distribution of biases has the form 
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As a third and final step we have to specify our knowledge 
of C and h rms . To keep things simple let us assume that 
there is something about the situation which makes us 
certain about the time scales on which bias can vary, so 
that we know the correlation function C(r), but we don't 
know for sure how strong the biases can get, so we have 
to average over some distribution P(h rras ). 

Putting the various terms together, we can write the 
probability of observing a sequence {a n } in the broad 
family of biased models as 

P({cr n }|biased) 

dh d N hP({a n }\{h n })P({h n }\h rms )P(h 

rms / 1 

(8) 

and then 

P({(7„}) = P({cr n }|fair coin)P(fair coin) 

+P({cr n }|biased)P(biased). (9) 



Thus the probability of an observed sequence arising from 
a fair coin is 



F(fair coin|{cr n }) = 



1 



(10) 



1 + exp(A — /i) ' 

where the log-likelihood ratio is 

A = ln[P({CT n }|biased)/P({CT n }|fair coin)], (11) 

and the threshold /i = ln[P(fair coin)/P(biased)]. 

In the limit that h Tms is small we can compute the 
integral over {h n } in Eq © as a perturbation series. To 
fourth order in /i rms , the result is 



d A '/i i P({a n }|{/ ln })P({/i n }|/i rms ) 
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where the correlation time is defined by 

Tc = -1-Tr(7 2 = iVc 2 (n), 

ON /^i w ' 



(12) 
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a = 1 + yy/4g/NT c , g = (Tr C* 4 )/(Tr C* 2 ), and the vari- 
ables z and y depend on the particular sequence that we 
observe: 
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The normalization of z and y is chosen so that they each 
have zero mean and unit variance if the {cr n } actually are 
generated by a fair coin. The exponential in Eq H12|l sets 
a scale for h rms ~ A -1 / 4 , which becomes small at large 
N . Thus when we do the integral over h rms in Eq ® it 
will be dominated by very small values of /i rms , so the 
relevant parameter is P(h lms — > 0) = p |l6l |. Then 



P({o- n }|biased) 
_P_ 

~ 2 N / ™ 



; exp 



(16) 



At large N and z < the integral is dominated by its 
behavior near h Tnls = 0, while for z > it is dominated 
by a saddle point at h* = (z 2 /a 2 NT c y/ 4 . When the dust 
settles, the large N approximation to the log-likelihood 
ratio becomes 



A(z < 0) 
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A (a >0) ~ Z - + \ln 
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(17) 
(18) 
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where the dependence on the observed sequence is 
through the value of z (and, negligibly in this limit, also 
through the dependence of a on y) . 

From Eq (|10fl we see that if A > p then the probabil- 
ity of the sequence {<r n } being described by a fair coin 
is less than half. Equivalently, if A > p then the data 
are more likely to be described by a biased model. If we 
want to make correct assignments with highest probabil- 
ity, then the correct decision rule is maximum likelihood 
|4j, so that all sequences with A > /i should be labeled 
as biased. Obviously there is some probability that this 
condition is met even if the sequence in fact was gener- 
ated at random. Note that for large N the variable z, 
which determines A, is the sum of many terms which are 
independent if the sequence really is random. Thus the 
distribution of z approaches a Gaussian (with zero mean 
and unit variance, by construction) and we can estimate 
the probability that A > p, which is the probability that 
a random sequence will be identified as biased |l7j . 

The condition A > p corresponds either to z— < z < 
or z > 2 + , where 
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Then a sequence generated by a fair coin will be assigned 
as biased with a probability given by 
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For large N, \z-\ <1 and z + ^> 1, which lets us approx- 
imate the integrals to find 
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where r = pP(biased)/P(fair coin) and I neglect terms 
of the form lnlnA^. At sufficiently large N the second 
term dominates, and we have simply 
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(23) 



where the scale is set by N c = r 4 / (16r c ). 

The power-law decay of the error probability is surpris- 
ing if we have in mind the conventional problem of sta- 
tistical hypothesis testing. With two specific alternative 
hypotheses, e.g. the fair coin and a coin with fixed (and 
known) bias, the log-likelihood ratio on average grows 
linearly with the number of examples that we observe, 
this growth rate being the Kullback-Leibler divergence 
between the two hypothesized distributions, and the re- 
sulting error probability falls exponentially. By allow- 
ing for the possibility that biases change with time over 



the course of our observations — which seems plausible in 
many natural contexts — we construct a family of models 
with a number of parameters that grows as the size of 
the data set increases. In addition, the particular model 
of bias considered here allows larger data sets to be used 
to test for weaker biases (smaller h Tms ), which again is 
plausible but very different from the simpler problem of 
discriminating between two fixed distributions. To sum- 
marize, if we test data for a fixed bias of known size, then 
the probability of mistakenly finding order in a random 
sequence decays exponentially. If on the other hand we 
test for time dependent biases of unknown magnitude, 
the error probability falls as a power-law. The case con- 
sidered here generates a 1/4 power, but one can construct 
models that generate other powers as well [lfij . 

In the case of two specific hypotheses, the exponen- 
tial dependence of error probability also means that the 
trading between number of examples and prior proba- 
bility is only logarithmic. Suppose that when fair and 
biased coins are equally likely a priori (p = 0), it takes 
Nq examples to reach some criterion level of error. If we 
now imagine that truly fair coins are rare, in the con- 
ventional hypothesis testing view it will take ANq oc 
ln[P(biased)/P(fair coin)] additional examples to reach 
the same level of error. In contrast, when we search for 
time dependent biases with unknown magnitude, we are 
much more sensitive to prior probabilities, since Eq Ij23(l 
predicts N oc N c , or N a -> N [P(biased)/P(fair coin)] 4 . 
Concretely, if we can expect that genuinely random se- 
quences constitute only ~ 10% of the events that we will 
see, that the typical scale of biases is h ims ~ 1 so that 
p ~ 1, and that correlation times are r c ~ 10 events, then 
N c ~ 40 and hence even at N = 10 4 the error probability 
is P crror ~ 0.25 0. 

The sequences which are identified as biased are pri- 
marily those with large positive values of z from Eq 114|l . 
If the expected correlation function of the bias is every- 
where positive (so that we don't expect oscillating bi- 
ases) , then this singles out sequences that have an excess 
of runs with multiple heads or tails in a row. Put another 
way, sequences which are declared to be random have 
fewer runs than expected in genuinely random distribu- 
tions. This is in agreement with experiments showing 
that when human subjects are asked to generate random 
sequences or to assess sequences for randomness, they be- 
have according to a "representativeness heuristic" |lj| or 
as if there were a "law of small numbers" |2£j according to 
which short sequences are more typical of long sequence 
mean behavior than actually predicted for a fair coin. 

All of the analysis done here can be repeated in a model 
where "bias" is replaced with serial correlation. Then 
the natural parameter is not a local magnetic field but a 
local exchange interaction J n between neighboring spins 
cr n and tr n _i. The analog of z [Eq i|14fl ] which controls 
the likelihood of a sequence being assigned as random or 
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correlated is played by 



^fAN~^ 



(cr n O'n+l)Cnm(o'mO'm+l) — N 



, (24) 
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where C is now the correlation matrix and r c is the cor- 
relation time for fluctuations in J. Again the sequences 
identified as biased will be those with large positive z', 
corresponding to an excess of either repetitions or alter- 
nations. This may provide an even more accurate de- 
scription of human perceptual biases toward representa- 
tiveness in small samples [2J, Il9j . 

The model of dynamic biases considered here is closely 
related to the experiments in Refs Q and Q , where ani- 
mals experience time dependent biases in reward proba- 
bility and have to modulate their behaviors accordingly. 
In these experiments the possibility of a fair coin is ex- 
cluded by construction, so the question is not to deter- 
mine whether the bias exists but rather to determine its 
current value and use this value in decision making |2lj |. 
Within the model above one can show that, for weak bi- 
ases, the optimal estimate is h n = X)i>o ^'(i) <T n-i, where 
the kernel K is determined by 
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K(n-j) = C n 



(25) 



Specifically, if C(r) = exp(— |t|/t c ), then the kernel is 
also exponential, K{t) oc exp(— r/ri nt ), where the in- 
tegration time is Tj nt = r c / + 2h 2 ms T c . Thus, when 
correlation times are long, even if /i rms is not too large 
the optimal estimator integrates for a time much shorter 
than the correlation time — e.g., with correlation times 
over order 100 events, T; nt ~ 10—15 events, in reason- 
able agreement with the results of Ref Q • 

In the framework considered here, the failure to rec- 
ognize genuinely random sequences arises not as a lim- 
itation but rather as an inevitable consequence of the 
optimal search for weak dynamic biases. Important tests 
of this idea thus include generalizations of the experi- 
ments in Refs |a, |jj that allow detailed comparisons of 
human strategies for bias estimation with the optimal 
strategy, especially the prediction that the optimal strat- 
egy shifts with the context defined by the distribution 
out of which the fluctuating biases are drawn. Although 
not a complete theory, it seems plausible that the ob- 
jective difficulty in inducing the concept of a truly fair 
coin found here could be related to other problems in 
reasoning about probability that usually are ascribed to 
subjective factors. 
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